27 research outputs found

    Semisupervised Speech Data Extraction from Basque Parliament Sessions and Validation on Fully Bilingual Basque–Spanish ASR

    Get PDF
    In this paper, a semisupervised speech data extraction method is presented and applied to create a new dataset designed for the development of fully bilingual Automatic Speech Recognition (ASR) systems for Basque and Spanish. The dataset is drawn from an extensive collection of Basque Parliament plenary sessions containing frequent code switchings. Since session minutes are not exact, only the most reliable speech segments are kept for training. To that end, we use phonetic similarity scores between nominal and recognized phone sequences. The process starts with baseline acoustic models trained on generic out-of-domain data, then iteratively updates the models with the extracted data and applies the updated models to refine the training dataset until the observed improvement between two iterations becomes small enough. A development dataset, involving five plenary sessions not used for training, has been manually audited for tuning and evaluation purposes. Cross-validation experiments (with 20 random partitions) have been carried out on the development dataset, using the baseline and the iteratively updated models. On average, Word Error Rate (WER) reduces from 16.57% (baseline) to 4.41% (first iteration) and further to 4.02% (second iteration), which corresponds to relative WER reductions of 73.4% and 8.8%, respectively. When considering only Basque segments, WER reduces on average from 16.57% (baseline) to 5.51% (first iteration) and further to 5.13% (second iteration), which corresponds to relative WER reductions of 66.7% and 6.9%, respectively. As a result of this work, a new bilingual Basque–Spanish resource has been produced based on Basque Parliament sessions, including 998 h of training data (audio segments + transcriptions), a development set (17 h long) designed for tuning and evaluation under a cross-validation scheme and a fully bilingual trigram language model.This work was partially funded by the Spanish Ministry of Science and Innovation (OPEN-SPEECH project, PID2019-106424RB-I00) and by the Basque Government under the general support program to research groups (IT-1704-22)

    Using cross-decoder phone coocurrences in phonotactic language recognition

    Full text link
    Phonotactic language recognizers are based on the ability of phone decoders to produce phone sequences containing acoustic, phonetic and phonological information, which is partially dependent on the language. Input utterances are de-coded and then scored by means of models for the target lan-guages. Commonly, various decoders are applied in parallel and fused at the score level. A kind of complementarity ef-fect is expected when fusing scores, since each decoder is assumed to extract different (and complementary) informa-tion from the input utterance. This assumption is supported by the performance improvements attained when fusing sys-tems. However, decodings are processed in a fully uncou-pled way, their time alignment (and the information that may be extracted from it) being completely lost. In this paper, a simple approach is proposed, which takes into account time alignment information, by considering cross-decoder phone coocurrences at the frame level. To evaluate the approach, a choice of open software (BUT front-end and phone decoders, SRI-LM toolkit, libSVM, FoCal) is used, and experiments are carried out on the NIST LRE2007 database. Adding phone coocurrences to the baseline phonotactic systems pro-vides slight performance improvements, revealing the poten-tial benefit of using cross-decoder dependencies for language modeling

    Query-by-Example Spoken Term Detection ALBAYZIN 2012 evaluation: overview, systems, results and discussion

    Get PDF
    The final publication is available at Springer via http://dx.doi.org/10.1186/1687-4722-2013-23Query-by-Example Spoken Term Detection (QbE STD) aims at retrieving data from a speech data repository given an acoustic query containing the term of interest as input. Nowadays, it has been receiving much interest due to the high volume of information stored in audio or audiovisual format. QbE STD differs from automatic speech recognition (ASR) and keyword spotting (KWS)/spoken term detection (STD) since ASR is interested in all the terms/words that appear in the speech signal and KWS/STD relies on a textual transcription of the search term to retrieve the speech data. This paper presents the systems submitted to the ALBAYZIN 2012 QbE STD evaluation held as a part of ALBAYZIN 2012 evaluation campaign within the context of the IberSPEECH 2012 Conferencea. The evaluation consists of retrieving the speech files that contain the input queries, indicating their start and end timestamps within the appropriate speech file. Evaluation is conducted on a Spanish spontaneous speech database containing a set of talks from MAVIR workshopsb, which amount at about 7 h of speech in total. We present the database metric systems submitted along with all results and some discussion. Four different research groups took part in the evaluation. Evaluation results show the difficulty of this task and the limited performance indicates there is still a lot of room for improvement. The best result is achieved by a dynamic time warping-based search over Gaussian posteriorgrams/posterior phoneme probabilities. This paper also compares the systems aiming at establishing the best technique dealing with that difficult task and looking for defining promising directions for this relatively novel task.Tejedor, J.; Toledano, DT.; Anguera, X.; Varona, A.; Hurtado Oliver, LF.; Miguel, A.; Colás, J. (2013). Query-by-Example Spoken Term Detection ALBAYZIN 2012 evaluation: overview, systems, results and discussion. EURASIP Journal on Audio, Speech, and Music Processing. (23):1-17. doi:10.1186/1687-4722-2013-23S11723Zhang T, Kuo CCJ: Hierarchical classification of audio data for archiving and retrieving. In Proceedings of ICASSP. Phoenix; 15–19 March 1999:3001-3004.Helén M, Virtanen T: Query by example of audio signals using Euclidean distance between Gaussian Mixture Models. In Proceedings of ICASSP. Honolulu; 15–20 April 2007:225-228.Helén M, Virtanen T: Audio query by example using similarity measures between probability density functions of features. EURASIP J. Audio Speech Music Process 2010, 2010: 2:1-2:12.Tzanetakis G, Ermolinskyi A, Cook P: Pitch histograms in audio and symbolic music information retrieval. In Proceedings of the Third International Conference on Music Information Retrieval: ISMIR. Paris; 2002:31-38.Tsai HM, Wang WH: A query-by-example framework to retrieve music documents by singer. In Proceedings of the IEEE International Conference on Multimedia and Expo. Taipei; 27–30 June 2004:1863-1866.Chia TK, Sim KC, Li H, Ng HT: A lattice-based approach to query-by-example spoken document retrieval. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Singapore; 20–24 July 2008:363-370.Tejedor J, Fapšo M, Szöke I, Černocký H, Grézl F: Comparison of methods for language-dependent and language-independent query-by-example spoken term detection. ACM Trans. Inf. Syst 2012, 30(3):18:1-18:34.Muscariello A, Gravier G, Bimbot F: Zero-resource audio-only spoken term detection based on a combination of template matching techniques. In Proceedings of Interspeech. Florence; 27–31 August 2011:921-924.Lin H, Stupakov A, Bilmes J: Spoken keyword spotting via multi-lattice alignment. In 9th International Speech Communication Association Annual Conference. Brisbane, Australia; September 2008:2191-2194.Parada C, Sethy A, Ramabhadran B: Query-by-Example Spoken Term Detection for OOV terms. In Proceedings of ASRU. Merano; 13-17 December 2009:404-409.Shen W, White TJ, Hazen CM: A comparison of Query-by-Example methods for Spoken Term Detection. In Proceedings of Interspeech. Brighton; September 2009:2143-2146.Lin H, Stupakov A, Bilmes J: Improving multi-lattice alignment based spoken keyword spotting. In Proceedings of ICASSP. Taipei; 19–24 April 2009:4877-4880.Barnard E, Davel M, van Heerden C, Kleynhans N, Bali K: Phone recognition for spoken web search. In Proceedings of MediaEval. Pisa; 1–2 September 2011:5-6.Buzo A, Cucu H, Safta M, Ionescu B, Burileanu C: ARF@MediaEval 2012: a Romanian ASR-based approach to spoken term detection. In Proceedings of MediaEval. Pisa; 4–5 October 2012:7-8.Abad A, Astudillo RF: The L2F spoken web search system for MediaEval 2012. In Proceedings of MediaEval. Pisa; 4–5 October 2012:9-10.Varona A, Penagarikano M, Rodríguez-Fuentes L, Bordel L, Diez M: GTTS system for the spoken web search task at MediaEval 2012. In Proceedings of MediaEval. Pisa; 4–5 October 2012:13-14.Szöke I, Faps̆o M, Veselý K: BUT2012 Approaches for spoken web search - MediaEval 2012. In Proceedings of MediaEval. Pisa;4–5October 2012:15-16.Hazen W, Shen TJ, White CM: Query-by-Example spoken term detection using phonetic posteriorgram templates. In Proceedings of ASRU. Merano; 13–17 December 2009:421-426.Zhang Y, Glass JR: Unsupervised spoken keyword spotting via segmental DTW on Gaussian Posteriorgrams. In Proceedings of ASRU. Merano; 13–17 December 2009:398-403.Chan C, Lee L: Unsupervised spoken-term detection with spoken queries using segment-based dynamic time warping. In Proceedings of Interspeech. Makuhari; 26–30 September 2010:693-696.Anguera X, Macrae R, Oliver N: Partial sequence matching using an unbounded dynamic time warping algorithm. In Proceedings of ICASSP. Dallas; 14–19 March 2010:3582-3585.Anguera X: Telefonica system for the spoken web search Task at Mediaeval 2011. In Proceedings of MediaEval. Pisa; 1–2 September 2011:3-4.Muscariello A, Gravier G: Irisa MediaEval 2011 spoken web search system. In Proceedings of MediaEval. Pisa; 1–2 September 2011:9-10.Szöke I, Tejedor J, Faps̆o M, Colás J: BUT-HCTLab approaches for spoken web search - MediaEval 2011. In Proceedings of MediaEval. Pisa; 1–2 September 2011:11-12.Wang H, Lee T: CUHK System for the spoken web search task at Mediaeval 2012. In Proceedings of MediaEval. Pisa; 4–5 October 2012:3-4.Joder C, Weninger F, Wöllmer M, Schuller M: The TUM cumulative DTW approach for the Mediaeval 2012 spoken web search task. In Proceedings of MediaEval. Pisa; 4–5 October 2012:5-6.Vavrek J, Pleva M, Juhár J: TUKE MediaEval 2012: spoken web search using DTW and unsupervised SVM. In Proceedings of MediaEval. Pisa; 4–5 October 2012:11-12.Jansen A, Durme P, Clark BV: The JHU-HLTCOE spoken web search system for MediaEval 2012. In Proceedings of MediaEval. Pisa; 4–5 October 2012:17-18.Anguera X: Telefonica Research System for the spoken web search task at Mediaeval 2012. In Proceedings of MediaEval. Pisa; 4–5 October 2012:19-20.NIST: The Ninth Text REtrieval Conference (TREC 9). 2000. http://trec.nist.gov . Accessed 16 September 2013NIST: The Spoken Term Detection (STD) 2006 Evaluation Plan. 10 (National Institute of Standards and Technology (NIST), Gaithersburg, 2006). . Accessed 16 September 2013 http://www.nist.gov/speech/tests/stdSakai T, Joho H: Overview of NTCIR-9. Proceedings of NTCIR-9 Workshop 2011, 1-7.Rajput N, Metze F: Spoken web search. In Proceedings of MediaEval. Pisa; 1–2 September 2011:1-2.Metze F, Barnard E, Davel M, van Heerden C, Anguera X, Gravier G, Rajput N: Spoken web search. In Proceedings of MediaEval. Pisa; 4–5 October 2012:1-2.Tokyo University of Technology: Evaluation of information access technologies: information retrieval, question answering and cross-lingual information access. 2013. http://research.nii.ac.jp/ntcir/ntcir-10/ . Accessed 16 September 2013NIST: The OpenKWS13 evaluation plan. 1, (National Institute of Standards and Technology (NIST), Gaithersburg, 2013). . Accessed 16 September 2013 http://www.nist.gov/itl/iad/mig/openkws13.cfmTaras B, Nadeu C: Audio segmentation of broadcast news in the Albayzin-2010 evaluation: overview, results, and discussion. EURASIP J. Audio Speech Music Process 2011, 1: 1-10.Zelenák M, Schulz H, Hernando J: Speaker diarization of broadcast news in Albayzin 2010 evaluation campaign. EURASIP J. Audio Speech Music Process 2012, 19: 1-9.Rodríguez-Fuentes LJ, Penagarikano M, Varona A, Díez M, Bordel G: The Albayzin 2010 language recognition evaluation. In Proceedings of Interspeech. Florence; 27–31 August 2011:1529-1532.Méndez F, Docío L, Arza M, Campillo F: The Albayzin 2010 text-to-speech evaluation. In Proceedings of FALA. Vigo; November 2010:317-340.Fiscus JG, Ajot J, Garofolo JS, Doddington G: Results of the 2006 spoken term detection evaluation. In Proceedings of SIGIR Workshop Searching Spontaneous Conversational Speech. Rhodes; 22–25 September 2007:45-50.Martin A, Doddington G, Kamm T, Ordowski M, Przybocki M: The DET curve in assessment of detection task performance. In Proceedings of Eurospeech. Rhodes; 22-25 September 1997:1895-1898.NIST: NIST Speech Tools and APIs: 2006 (National Institute of Standards and Technology (NIST), Gaithersburg, 1996). . Accessed 16 September 2013 http://www.nist.gov/speech/tools/index.htmIberspeech 2012: VII Jornadas en Tecnología del Habla and III Iberian SLTech Workshop. . Accessed 16 September 2013 http://iberspeech2012.ii.uam.es/IberSPEECH2012_OnlineProceedings.pdfAnguera X: Speaker independent discriminant feature extraction for acoustic pattern-matching. In Proceedings of ICASSP. Kyoto; 25–30 March 2012:485-488.Anguera X, Ferrarons M: Memory efficient subsequence DTW for Query-by-Example spoken term detection. Proceedings of ICME 2013. http://www.xavieranguera.com/papers/sdtw_icme2013.pdfAnguera X: Telefonica Research System for the Query-by-example task at Albayzin 2012. In Proceedings of IberSPEECH. Madrid, Spain; 21–23 November 2012:626-632.Schwarz P: Phoneme recognition based on long temporal context. PhD Thesis, FIT, BUT, Brno, Czech Republic. 2008.Stolckem A: SRILM - an extensible language modeling toolkit. In Proceedings of Interspeech. Denver; 2002:901-904.Wang D, King S, Frankel J: Stochastic pronunciation modelling for out-of-vocabulary spoken term detection. IEEE Trans. Audio Speech Language Process 2011, 19(4):688-698.Wang D, Tejedor J, King S, Frankel J: Term-dependent confidence normalization for out-of-vocabulary spoken term detection. J. Comput. Sci. Technol 2012, 27(2):358-375. 10.1007/s11390-012-1228-xWang D, King S, Frankel J, Vipperla R, Evans N, Troncy R: Direct posterior confidence for out-of-vocabulary spoken term detection. ACM Trans. Inf. Syst 2012, 30(3):1-34.Varona A, Penagarikano M, Rodríguez-Fuentes LJ, Bordel G, Diez M: GTTS systems for the query-by-example spoken term detection task of the Albayzin 2012 search on speech evaluation. In Proceedings of IberSPEECH. Madrid, Spain; 21–23 November 2012:619-625.Gómez J, Sanchis E, Castro-Bleda M: Automatic speech segmentation based on acoustical clustering. Proceedings of the Joint IAPR International Conference on Structural, Syntactic, and Statistical Pattern Recognition 2010, 540-548.Gómez J, Castro M: Automatic segmentation of speech at the phonetic level. Proceedings of the joint IAPR International Workshop on Structural, Syntactic, and Statistical Pattern Recognition 2002, 672-680.Sanchis E, Hurtado LF, Gómez JA, Calvo M, Fabra R: The ELiRF Query-by-example STD systems for the Albayzin 2012 search on speech evaluation. In Proceedings of IberSPEECH. Madrid, Spain; 21–23 November 2012:611-618.Park A, Glass J: Towards unsupervised pattern discovery in speech. In Proceedings of ASRU. Cancun; 27 November to 1 December 2005:53-58.Young S, Evermann G, Gales M, Hain T, Kershaw D, Liu X, Moore G, Odell J, Ollason D, Povey D, Valtchev V, Woodland P: The HTK Book. Engineering Department, Cambridge University; 2006.Miguel A, Villalba J, Ortega A, Lleida E: Albayzin 2012 search on speech @ ViVoLab UZ. In Proceedings of IberSPEECH. Madrid, Spain; 21–23 November 2012:633-642.Boersma P, Weenink D: Praat: Doing Phonetics by Computer. University of Amsterdam, Spuistraat, 210, Amsterdam, Holland. 2007. http://www.fon.hum.uva.nl/praat/ . Accessed 16 September 2013Goldwater S, Jurafsky D, Maning CD: Which words are hard to recognize? Prosodic, lexical, and disfluency factors that increase speech recognition error rates. Speech Commun 2009, 52(3):181-200.Mertens T, Wallace R, Schneider D: Cross-site combination and evaluation of subword spoken term detection systems. In Proceedings of CBMI. Madrid; 13–15 June 2011:61-66

    An Overview of the IberSpeech-RTVE 2022 Challenges on Speech Technologies

    Get PDF
    Evaluation campaigns provide a common framework with which the progress of speech technologies can be effectively measured. The aim of this paper is to present a detailed overview of the IberSpeech-RTVE 2022 Challenges, which were organized as part of the IberSpeech 2022 conference under the ongoing series of Albayzin evaluation campaigns. In the 2022 edition, four challenges were launched: (1) speech-to-text transcription; (2) speaker diarization and identity assignment; (3) text and speech alignment; and (4) search on speech. Different databases that cover different domains (e.g., broadcast news, conference talks, parliament sessions) were released for those challenges. The submitted systems also cover a wide range of speech processing methods, which include hidden Markov model-based approaches, end-to-end neural network-based methods, hybrid approaches, etc. This paper describes the databases, the tasks and the performance metrics used in the four challenges. It also provides the most relevant features of the submitted systems and briefly presents and discusses the obtained results. Despite employing state-of-the-art technology, the relatively poor performance attained in some of the challenges reveals that there is still room for improvement. This encourages us to carry on with the Albayzin evaluation campaigns in the coming years.This work was partially supported by Radio Televisión Española through the RTVE Chair at the University of Zaragoza, and Red Temática en Tecnologías del Habla (RED2022-134270-T), funded by AEI (Ministerio de Ciencia e Innovación); It was also partially funded by the European Union’s Horizon 2020 research and innovation program under Marie Skłodowska-Curie Grant 101007666; in part by MCIN/AEI/10.13039/501100011033 and by the European Union “NextGenerationEU”/ PRTR under Grants PDC2021-120846C41 PID2021-126061OB-C44, and in part by the Government of Aragon (Grant Group T3623R); it was also partially funded by the Spanish Ministry of Science and Innovation (OPEN-SPEECH project, PID2019-106424RB-I00) and by the Basque Government under the general support program to research groups (IT-1704-22), and by projects RTI2018-098091-B-I00 and PID2021-125943OB-I00 (Spanish Ministry of Science and Innovation and ERDF) as well

    Cerámica bajo la escalera

    Get PDF
    40 p. : il., colCatálogo de la exposición celebrada en la Sala la Escalera, Facultad de Bellas Artes de Cuenca. Del 22 de Junio al 3 de julio de 2015.UPV/EHU, UCL

    Cerámica bajo la escalera

    Get PDF
    40 p. : il., colCatálogo de la exposición celebrada en la Sala la Escalera, Facultad de Bellas Artes de Cuenca. Del 22 de Junio al 3 de julio de 2015.UPV/EHU, UCL

    Antecedentes y desarrollo de los sistemas actuales de reconocimiento automático del habla

    Get PDF
    Se presenta un estudio de la evolución histórica de los sistemas de reconocimiento automático del habla continua. Estos sistemas se dividen fundamentalmente en dos etapas: la síntesis y el análisis. Un sintetizador es un instrumento capaz de reproducir la voz humana mientras que un analizador es un sistema capaz de reconocer la voz humana. Actualmente existen versiones muy desarrolladas de sistemas de síntesis, sin embargo el análisis continua siendo un problema abierto, muy importante en el campo de la inteligencia artificial. Por último, se realizan una serie de consideraciones sobre las perspectivas futuras en este áre

    Cuadernos de pedagogía

    No full text
    Resumen basado en el que aporta la revistaTrabajar un tema tan próximo a los niños como el de sus animales de compañía ofrece muchas ventajas al profesorado: le revela aspectos personales de su alumnado, le permite fomentar el respeto hacia los animales y, además, favorece la motivación y la desinhibición. Se presenta una propuesta didáctica sobre las mascotas llevada a cabo con niños de primer curso de primaria. A través de actividades participativas, lúdicas e imaginativas se busca transmitir la importancia del cuidado y protección de los animales e inculcar el respeto hacia la expresión oral propia y de los compañeros. Se ha prestado una especial atención a los alumnos con algún tipo de dificultad, mostrando que una experiencia concebida para ellos puede ser útil también para el resto del grupo.CataluñaES

    Reconocimiento de la Lengua en Albayzin 2010 LRE utilizando características PLLR

    Get PDF
    Los así denominados Phone Log-Likelihood Ratios (PLLR), han sido introducidos como características alternativas a los MFCC-SDC para sistemas de Reconocimiento de la Lengua (RL) mediante iVectors. En este artículo, tras una breve descripción de estas características, se proporcionan nuevas evidencias de su utilidad para tareas de RL, con un nuevo conjunto de experimentos sobre la base de datos Albayzin 2010 LRE, que contiene habla multi-locutor de banda ancha en seis lenguas diferentes: euskera, catalán, gallego, español, portugués e inglés. Los sistemas de iVectors entrenados con PLLRs obtienen mejoras relativas significativas respecto a los sistemas fonotácticos y sistemas de iVectors entrenados con características MFCC-SDC, tanto en condiciones de habla limpia como con habla ruidosa. Las fusiones de los sistemas PLLR con los sistemas fonotácticos y/o sistemas basados en MFCC-SDC proporcionan mejoras adicionales en el rendimiento, lo que revela que las características PLLR aportan información complementaria en ambos casos.Phone Log-Likelihood Ratios (PLLR) have been recently proposed as alternative features to MFCC-SDC for iVector Spoken Language Recognition (SLR). In this paper, PLLR features are first described, and then further evidence of their usefulness for SLR tasks is provided, with a new set of experiments on the Albayzin 2010 LRE dataset, which features wide-band multi speaker TV broadcast speech on six languages: Basque, Catalan, Galician, Spanish, Portuguese and English. iVector systems built using PLLR features, computed by means of three open-source phone decoders, achieved significant relative improvements with regard to the phonotactic and MFCC-SDC iVector systems in both clean and noisy speech conditions. Fusions of PLLR systems with the phonotactic and/or the MFCC-SDC iVector systems led to improved performance, revealing that PLLR features provide complementary information in both cases.This work has been supported by the University of the Basque Country under grant GIU10/18 and project US11/06 and by the Government of the Basque Country under program SAIOTEK (project S-PE12UN55). M. Diez is supported by a research fellowship from the Department of Education, Universities and Research of the Basque Country Government
    corecore